Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from utils import *
kakamana
May 13, 2023
In this notebook you will visualize how a decision tree is split using information gain.
This Decision Trees is part of DeepLearning.AI course: Machine Learning Specialization / Course 2: Advanced Learning Algorithms: Decision Trees Learning You will build and train a neural network with TensorFlow to perform multi-class classification in the second course of the Machine Learning Specialization. Ensure that your machine learning models are generalizable by applying best practices for machine learning development. You will build and train a neural network using TensorFlow to perform multi-class classification in the second course of the Machine Learning Specialization. Implement best practices for machine learning development to ensure that your models are generalizable to real-world data and tasks. Create and use decision trees and tree ensemble methods, including random forests and boosted trees.
This is my learning experience of data science through DeepLearning.AI. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.
These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellectual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp
As you saw in the lectures, in a decision tree, we decide if a node will be split or not by looking at the information gain that split would give us. (Image of video IG)
Where
\[\text{Information Gain} = H(p_1^\text{node})- \left(w^{\text{left}}H\left(p_1^\text{left}\right) + w^{\text{right}}H\left(p_1^\text{right}\right)\right),\]
and \(H\) is the entropy, defined as
\[H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)\]
Remember that log here is defined to be in base 2. Run the code block below to see by yourself how the entropy. \(H(p)\) behaves while \(p\) varies.
Note that the H attains its higher value when \(p = 0.5\). This means that the probability of event is \(0.5\). And its minimum value is attained in \(p = 0\) and \(p = 1\), i.e., the probability of the event happening is totally predictable. Thus, the entropy shows the degree of predictability of an event.
Ear Shape | Face Shape | Whiskers | Cat | |
---|---|---|---|---|
Pointy | Round | Present | 1 | |
Floppy | Not Round | Present | 1 | |
Floppy | Round | Absent | 0 | |
Pointy | Not Round | Present | 0 | |
Pointy | Round | Present | 1 | |
Pointy | Round | Absent | 1 | |
Floppy | Not Round | Absent | 0 | |
Pointy | Round | Absent | 1 | |
Floppy | Round | Absent | 0 | |
Floppy | Round | Absent | 0 |
We will use one-hot encoding to encode the categorical features. They will be as follows:
Therefore, we have two sets:
X_train
: for each example, contains 3 features: - Ear Shape (1 if pointy, 0 otherwise) - Face Shape (1 if round, 0 otherwise) - Whiskers (1 if present, 0 otherwise)
y_train
: whether the animal is a cat - 1 if the animal is a cat - 0 otherwise
This means that the first example has a pointy ear shape, round face shape and it has whiskers.
On each node, we compute the information gain for each feature, then split the node on the feature with the higher information gain, by comparing the entropy of the node with the weighted entropy in the two splitted nodes.
So, the root node has every animal in our dataset. Remember that \(p_1^{node}\) is the proportion of positive class (cats) in the root node. So
\[p_1^{node} = \frac{5}{10} = 0.5\]
Now let’s write a function to compute the entropy.
1.0
To illustrate, let’s compute the information gain if we split the node for each of the features. To do this, let’s write some functions.
def split_indices(X, index_feature):
"""Given a dataset and a index feature, return two lists for the two split nodes, the left node has the animals that have
that feature = 1 and the right node those that have the feature = 0
index feature = 0 => ear shape
index feature = 1 => face shape
index feature = 2 => whiskers
"""
left_indices = []
right_indices = []
for i,x in enumerate(X):
if x[index_feature] == 1:
left_indices.append(i)
else:
right_indices.append(i)
return left_indices, right_indices
So, if we choose Ear Shape to split, then we must have in the left node (check the table above) the indices:
\[0 \quad 3 \quad 4 \quad 5 \quad 7\]
and the right indices, the remaining ones.
Now we need another function to compute the weighted entropy in the splitted nodes. As you’ve seen in the video lecture, we must find:
Note the difference between these two definitions!! To illustrate, if we split the root node on the feature of index 0 (Ear Shape), then in the left node, the one that has the animals 0, 3, 4, 5 and 7, we have:
\[w^{\text{left}}= \frac{5}{10} = 0.5 \text{ and } p^{\text{left}} = \frac{4}{5}\] \[w^{\text{right}}= \frac{5}{10} = 0.5 \text{ and } p^{\text{right}} = \frac{1}{5}\]
def weighted_entropy(X,y,left_indices,right_indices):
"""
This function takes the splitted dataset, the indices we chose to split and returns the weighted entropy.
"""
w_left = len(left_indices)/len(X)
w_right = len(right_indices)/len(X)
p_left = sum(y[left_indices])/len(left_indices)
p_right = sum(y[right_indices])/len(right_indices)
weighted_entropy = w_left * entropy(p_left) + w_right * entropy(p_right)
return weighted_entropy
0.7219280948873623
So, the weighted entropy in the 2 split nodes is 0.72. To compute the Information Gain we must subtract it from the entropy in the node we chose to split (in this case, the root node).
Now, let’s compute the information gain if we split the root node for each feature:
for i, feature_name in enumerate(['Ear Shape', 'Face Shape', 'Whiskers']):
left_indices, right_indices = split_indices(X_train, i)
i_gain = information_gain(X_train, y_train, left_indices, right_indices)
print(f"Feature: {feature_name}, information gain if we split the root node using this feature: {i_gain:.2f}")
Feature: Ear Shape, information gain if we split the root node using this feature: 0.28
Feature: Face Shape, information gain if we split the root node using this feature: 0.03
Feature: Whiskers, information gain if we split the root node using this feature: 0.12
So, the best feature to split is indeed the Ear Shape. Run the code below to see the split in action. You do not need to understand the following code block.
Depth 0, Root: Split on feature: 0
- Left leaf node with indices [0, 3, 4, 5, 7]
- Right leaf node with indices [1, 2, 6, 8, 9]
KeyError: 0
The process is recursive, which means we must perform these calculations for each node until we meet a stopping criteria:
Depth 0, Root: Split on feature: 0
- Depth 1, Left: Split on feature: 1
-- Left leaf node with indices [0, 4, 5, 7]
-- Right leaf node with indices [3]
- Depth 1, Right: Split on feature: 2
-- Left leaf node with indices [1]
-- Right leaf node with indices [2, 6, 8, 9]
KeyError: 0